Large-scale systems with arrays of solid state disks (SSDs) have becomeincreasingly common in many computing segments. To make such systems resilient,we can adopt erasure coding such as Reed-Solomon (RS) code as an alternative toreplication because erasure coding can offer a significantly lower storage costthan replication. To understand the impact of using erasure coding on systemperformance and other system aspects such as CPU utilization and networktraffic, we build a storage cluster consisting of approximately one hundredprocessor cores with more than fifty high-performance SSDs, and evaluate thecluster with a popular open-source distributed parallel file system, Ceph. Thenwe analyze behaviors of systems adopting erasure coding from the following fiveviewpoints, compared with those of systems using replication: (1) storagesystem I/O performance; (2) computing and software overheads; (3) I/Oamplification; (4) network traffic among storage nodes; (5) the impact ofphysical data layout on performance of RS-coded SSD arrays. For all theseanalyses, we examine two representative RS configurations, which are used byGoogle and Facebook file systems, and compare them with triple replication thata typical parallel file system employs as a default fault tolerance mechanism.Lastly, we collect 54 block-level traces from the cluster and make themavailable for other researchers.
展开▼